Assignment on RPubs
Rmd on GitHub

Introduction

These are the libraries we will be using.

library(tm)
## Loading required package: NLP
library(SnowballC)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(stringr)
library(RWeka)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(tidytext)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.2     v dplyr   0.8.5
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()

Approach

We had discussions about which sets of data to use. The internet is filled with a plethora of data, but few were of applicable use. Linkedin was not readily accessible.

Importing

The data we ended up using is from the web site https://data.world/jobspikr/10000-data-scientist-job-postings-from-the-usa. Phil was able to obtain the csv. The data was scraped using JobsPikr. The csv consists of 10,000 records and had the following as text fields: crawl_timestamp, url, job_title, category, company_name, city, state, country, inferred_city, inferred_state, inferred_country, post_date, job_description, job_type, salary_offered, job_board, geo, cursor, contact_email, contact_phone_number, uniq_id, and html_job_desc.

This is the original csv with field names.

CSV Fieldnames

CSV Fieldnames

url <- "https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz"


jobs <- read_csv(url)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   post_date = col_date(format = ""),
##   salary_offered = col_logical(),
##   cursor = col_double(),
##   contact_email = col_logical(),
##   html_job_description = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 3 parsing failures.
##  row             col                     expected                          actual                                                                                                                          file
## 2070 salary_offered  1/0/T/F/TRUE/FALSE           Salary Range:    Undisclosed 'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
## 2499 job_description closing quote at end of file                                 'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
## 2499 NA              22 columns                   13 columns                      'https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz'
jobs

Initial problems:

  1. The csv file is 48.5 MB and the character set had UTF-8 characters.

  2. HTML code was left in the job_description field.

  3. Initial load of the csv into R would take time depending upon the hardware of the group member’s system.

  4. How would we convert to a normalized database?

  5. Some group member’s machines did not have the same amount of RAM. Vanita’s laptop had 8 GB RAM and Sung’s laptop was running 16 GB.


Excursus

This is a brief description of some of the steps taken to help alleviate the problems of importing the information for use in our project.

Infrastructure

Azure, MySQL, and Cognitive Text Analytics Sung wanted to experiment with cloud computing and so he used his SPS email to create a trial account for Azure. He created a MySQL database instance to host the databases used. Each trial account is given a $200 credit.

MySQL database access This connection is made public but is restricted by IP address. If you desire access, please email one of the Team members.

Server: data607.mysql.database.azure.com
Port: 3306
Username:
Password: student#2020

Default timeout settings for the database server had to be lengthened to enable longer processing.


This is a link to the mysqldump of the database.

Diagram of Database E-R Digram

Tools

Microsoft’s Cognitive Text Analytics, Power BI, Slack, Github, and MySQL Workbench In addition to the database instance, he created, Sung created an instance for Azure’s Cognitive Text Analytics to experiment to see what information Microsoft’s AI can mine from our data. The tools were used to facilitate handling the data.

Power BI is a Microsoft product that is used to visualize data. It was employed to work with Microsoft’s Cognitive Text Analytics to extrapolate keyphrases from the job_descriptions of the data and to create a simple word cloud to compare with our later analysis.

Data Management

As an option to curtail the amount of time needed to process the 10,000 row csv, group members worked with a small subset of the file, tested code, and then would work with the larger data set. Some group members machines would complain about memory errors.

Files being worked on from Github were compressed to enable quicker transport across the network.

This is an example of a memory error on Sung’s laptop when trying to text mine on the job description data: Memory Error

Tidying

Analysis

Sections

Section 1

Section 2

Section 3

Conclusions